Data presentation and exploration

Source and goal of the analyisis

The data of this analysis can be found here.

The goal of the analysis is to represent the growth of the fishes as a function of the age and water temperature.

Data collection

The fish are kept in tanks at 25, 27, 29 and 31 degrees Celsius.

After birth, a test specimen is chosen at random every 14 days and its length measured. There are 44 rows of data.

The data include:

  • the age of the fish in days;
  • the water temperature in Celsius degrees ;
  • the length of the fish in decimillimeter.
age temperature length
14 25 620
28 25 1315
41 25 2120
55 25 2600
69 25 3110
83 25 3535

Basic graphical analyisis

From the graphs we can already see something interesting.

The growing function of the fishes seems to behave well, so hopefully we can estimate it properly.

The temperature seems to have a clear effect on our fishes: it does nothing until a critical value, but when it reaches 31 Celsius degree has an evident impact, reducing the length of the fishes.

Are we frequentist?…

Linear regression above all the features

In a frequentist framework, the dubmest thing that we can do is to throw a linear regressor on everything and see what happens.

From now on we will refer as:

  • length with \(z\)
  • age with \(x\)
  • temperature with \(y\)

This model is given by:

\[z= ax + by +c\]

## 
## Call:
## lm(formula = z ~ x + y)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1406.27  -398.59    30.73   313.94  1291.12 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 3904.266   1149.044   3.398  0.00152 ** 
## x             26.241      2.055  12.769 7.11e-16 ***
## y           -106.414     40.452  -2.631  0.01195 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 600 on 41 degrees of freedom
## Multiple R-squared:  0.8056, Adjusted R-squared:  0.7962 
## F-statistic: 84.98 on 2 and 41 DF,  p-value: 2.607e-15

This model gives us a multiple R-squared of \(80\)%, not bad, but we can surely do better.

Let’s take a look at the fitted plane:

As we could expect, the simple flat plane is not capable of recovering the curvature of the function, moreover, this model is not able to get the influence of the temperature.

Polynomial regression: no temperature into account

As we can read in this study by Johannes Hamre Espen Johnsen and Kristin Hamre, a lot of fishes growth function can be estimated with a second order polynomial fit. So let’s try to fit it (for a moment we leave out the temperature factor).

\[z = a + bx +cx^2\]

## 
## Call:
## lm(formula = df$length ~ poly(df$age, 2, raw = T), data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1099.77   -81.28    75.85   320.95   741.60 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               -364.91117  255.59918  -1.428    0.161    
## poly(df$age, 2, raw = T)1   69.04526    7.05444   9.787 2.75e-12 ***
## poly(df$age, 2, raw = T)2   -0.25642    0.04117  -6.229 2.05e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 465 on 41 degrees of freedom
## Multiple R-squared:  0.8833, Adjusted R-squared:  0.8776 
## F-statistic: 155.1 on 2 and 41 DF,  p-value: < 2.2e-16

We reach an r squared of \(90\)%, so the second grade polynomial is a good way to estimate our function.

Obviously our model is biased through the non-hot function, since we have three registrations vs one.

Polynomial regression: Temperature models

To bring in temperature in our model, we can start by trying to fit this polynomial regression by taking the two groups, and analyze the result.

Summaries

General
## 
## Call:
## lm(formula = df$length ~ poly(df$age, 2, raw = T), data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1099.77   -81.28    75.85   320.95   741.60 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               -364.91117  255.59918  -1.428    0.161    
## poly(df$age, 2, raw = T)1   69.04526    7.05444   9.787 2.75e-12 ***
## poly(df$age, 2, raw = T)2   -0.25642    0.04117  -6.229 2.05e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 465 on 41 degrees of freedom
## Multiple R-squared:  0.8833, Adjusted R-squared:  0.8776 
## F-statistic: 155.1 on 2 and 41 DF,  p-value: < 2.2e-16
Hot
## 
## Call:
## lm(formula = df$length ~ poly(df$age, 2, raw = T), data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -181.62  -69.91  -59.86  108.31  153.85 
## 
## Coefficients:
##                            Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               -43.50419  137.33227  -0.317     0.76    
## poly(df$age, 2, raw = T)1  52.56762    3.79032  13.869 7.07e-07 ***
## poly(df$age, 2, raw = T)2  -0.20858    0.02212  -9.430 1.31e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 124.9 on 8 degrees of freedom
## Multiple R-squared:  0.9848, Adjusted R-squared:  0.981 
## F-statistic: 259.6 on 2 and 8 DF,  p-value: 5.303e-08
Cold
## 
## Call:
## lm(formula = df$length ~ poly(df$age, 2, raw = T), data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -303.26  -64.45    1.38   74.18  545.68 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               -472.04682  107.12264  -4.407 0.000124 ***
## poly(df$age, 2, raw = T)1   74.53781    2.95654  25.211  < 2e-16 ***
## poly(df$age, 2, raw = T)2   -0.27237    0.01725 -15.786 4.48e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 168.8 on 30 degrees of freedom
## Multiple R-squared:  0.9862, Adjusted R-squared:  0.9853 
## F-statistic:  1073 on 2 and 30 DF,  p-value: < 2.2e-16

Fixing the temperature (or better; stating that the temperature is under or over the critical level) our r-squared bumps out to \(99\)%, the function becomes really stable.

Parameters estimate

model.coefficients hot_model.coefficients cold_model.coefficients
(Intercept) -364.9111654 -43.5041889 -472.0468243
poly(df$age, 2, raw = T)1 69.0452649 52.5676185 74.5378137
poly(df$age, 2, raw = T)2 -0.2564195 -0.2085775 -0.2723669

Confidence intervals

General
2.5 % 97.5 %
(Intercept) -881.1041862 151.2818554
poly(df$age, 2, raw = T)1 54.7985339 83.2919959
poly(df$age, 2, raw = T)2 -0.3395608 -0.1732782
Hot
2.5 % 97.5 %
(Intercept) -360.1929602 273.1845824
poly(df$age, 2, raw = T)1 43.8271289 61.3081081
poly(df$age, 2, raw = T)2 -0.2595854 -0.1575696
Cold
2.5 % 97.5 %
(Intercept) -690.8204320 -253.2732166
poly(df$age, 2, raw = T)1 68.4997454 80.5758821
poly(df$age, 2, raw = T)2 -0.3076039 -0.2371298

Looking at the parameter estimates and at confidence intervals, we gain another proof that the two models are indeed different.

To build a model that incorporates in a unique function the temperatures, let’s be Bayesian:

…Or ar we bayesian?

Simple model (no temperature)

Let’s at first construct the same simple model as before; a simple polynomial regression, and try to recover the parameters.

## module glm loaded
## Compiling model graph
##    Resolving undeclared variables
##    Allocating nodes
## Graph information:
##    Observed stochastic nodes: 44
##    Unobserved stochastic nodes: 4
##    Total graph size: 146
## 
## Initializing model
## Inference for Bugs model at "fish_model_simple.txt", fit using jags,
##  2 chains, each with 5e+05 iterations (first 10000 discarded), n.thin = 490
##  n.sims = 2000 iterations saved
##            mean    sd   2.5%    25%    50%    75% 97.5% Rhat n.eff
## a        -358.5 249.4 -847.8 -525.7 -359.6 -188.6 131.6    1   410
## b          68.9   7.1   55.1   64.2   68.9   73.6  82.8    1   770
## c          -0.3   0.0   -0.3   -0.3   -0.3   -0.2  -0.2    1   900
## deviance  666.4   2.9  662.8  664.4  665.7  667.7 673.4    1  1700
## sigma     472.7  52.9  381.6  435.8  467.3  505.7 579.6    1  2000
## 
## For each parameter, n.eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor (at convergence, Rhat=1).
## 
## DIC info (using the rule, pD = var(deviance)/2)
## pD = 4.2 and DIC = 670.6
## DIC is an estimate of expected predictive error (lower deviance is better).

To evaluate the goodness of the model we can use the DIC.

The deviance information criterion is a hierarchical modeling generalization of the Akaike information criterion. It is particularly useful in Bayesian model selection problems where the posterior distributions of the models have been obtained by Markov chain Monte Carlo (MCMC) simulation. DIC is an asymptotic approximation as the sample size becomes large, like AIC. It is only valid when the posterior distribution is approximately multivariate normal.

How to calculate it?

Define the deviance as \({\displaystyle D(\theta )=-2\log(p(y|\theta ))+C\,}\), where \(y\) are the data, \(\theta\) are the unknown parameters of the model and \(p(y|\theta )\) is the likelihood function. \(C\) is a constant that cancels out in all calculations that compare different models, and which therefore does not need to be known.

To calculate the effective number of parameters of the model, as described in Gelman(2004, p. 182) we use \(p_{D}=p_{V}=\frac{1}{2} \hat{var}(D(\theta))\). The larger the effective number of parameters is, the easier it is for the model to fit the data, and so the deviance needs to be penalized.

The deviance information criterion is calculated as

\({\mathit {DIC}}=p_{D}+{\bar {D}}\), or equivalently as

\({\mathit {DIC}}=D({\bar {\theta }})+2p_{D}\).

We get a DIC of 670. Now it’s only a row number, but it will be useful to compare this model with the next model, where we introduce the temperature.

Let’s take a look to the common diagnostic of our MCMC to asses the performance.

Bugs output

Density

density plots are approximately normal shaped(as we like them to be), and are very similar between the two chains.

Autocorrelation

Pretty good: with this big sample (quite big indeed), we have almost uncorrelated chains.

Fit

The model seems capable to recover realistic parameters, however we still keep the problems of the frequentist analysis; we want to encode the temperature in one only model; let’s try to do it.

Complex model (with temperature)

To introduce the temperature the proposal is to insert to parameters that modify the coefficients of the polynomial when the temperature is too hot(using a transformation of the temperature feature into a binary):

\[z = a + (b-\beta*h)*x +(c-\gamma*h) *x^2\]
## Compiling model graph
##    Resolving undeclared variables
##    Allocating nodes
## Graph information:
##    Observed stochastic nodes: 44
##    Unobserved stochastic nodes: 6
##    Total graph size: 234
## 
## Initializing model
## Inference for Bugs model at "fish_model_complex.txt", fit using jags,
##  2 chains, each with 5e+05 iterations (first 10000 discarded), n.thin = 490
##  n.sims = 2000 iterations saved
##            mean   sd   2.5%    25%    50%    75%  97.5% Rhat n.eff
## a        -369.1 93.5 -554.1 -430.0 -369.3 -311.3 -181.9    1  2000
## b          72.0  2.6   66.6   70.3   72.0   73.7   77.1    1  2000
## beta       11.2  2.6    6.0    9.4   11.3   12.9   16.0    1  2000
## c          -0.3  0.0   -0.3   -0.3   -0.3   -0.2   -0.2    1  2000
## deviance  576.7  3.8  571.7  573.9  575.9  578.7  585.7    1  2000
## gamma       0.0  0.0    0.0    0.0    0.0    0.0    0.0    1  2000
## sigma     170.9 19.9  137.8  156.7  169.0  183.6  214.4    1  2000
## 
## For each parameter, n.eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor (at convergence, Rhat=1).
## 
## DIC info (using the rule, pD = var(deviance)/2)
## pD = 7.1 and DIC = 583.8
## DIC is an estimate of expected predictive error (lower deviance is better).

We get a lower DIC, so our model it’s better.

Bugs output

Density

density plots are approximately normal shaped(as we like them to be), and are very similar between the two chains.

Autocorrelation

Pretty good: with this big sample (quite big indeed), we have almost uncorrelated chains.

Fit

Conclusions

After our analysis, we can state that, according to our data, the temperature is a relevant factor in the growth of fishes, and going further with this study could be helpful to have an idea of how global warming could affect the life of the marine animal population.

How ever, the sample of this work was not-so-big to draw strong conclusion, for example the fact that we have only one heavy-warmed tank could be a factor of bias.

To get more consistent results we should use larger samples.